Exploring Data Clustering

Approaches to explore:

Notes on DBSCAN:

There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.

More formally, we define a core sample as being a sample in the dataset such that there exist min_samples other samples within a distance of eps, which are defined as neighbors of the core sample. This tells us that the core sample is in a dense area of the vector space. A cluster is a set of core samples, that can be built by recursively by taking a core sample, finding all of its neighbors that are core samples, finding all of their neighbors that are core samples, and so on. A cluster also has a set of non-core samples, which are samples that are neighbors of a core sample in the cluster but are not themselves core samples. Intuitively, these samples are on the fringes of a cluster.

Any core sample is part of a cluster, by definition. Further, any cluster has at least min_samples points in it, following the definition of a core sample. For any sample that is not a core sample, and does have a distance higher than eps to any core sample, it is considered an outlier by the algorithm.

Consider instead using HDBSCAN


In [6]:
import glob
import os
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True)

In [7]:
from bokeh.io import output_notebook, show
from bokeh.resources import INLINE
output_notebook(resources=INLINE)


Loading BokehJS ...

In [8]:
import bokeh.plotting
import bokeh.models   # Legend, Range1d
import bokeh.layouts # gridplot
from bokeh.palettes import Colorblind7 as palette

In [9]:
import datashader
import collections
import xarray
from datashader.bokeh_ext import InteractiveImage

Get the data

Load the SAS data

Begin by designating where the theoretical SAXS and SANS data files are located.


In [10]:
import sys
sys.platform


Out[10]:
'linux'

In [11]:
if sys.platform == 'linux':
    sas_dir = '/home/schowell/data/scratch/sas_clustering/sascalc'
    pr_dir = '/home/schowell/data/scratch/sas_clustering/pr'

elif sys.platform == 'darwin':
    sas_dir = '/Users/schowell/scratch/sas_clustering/sascalc'
    pr_dir = '/Users/schowell/scratch/sas_clustering/pr'

saxs_dir = 'xray'
sans_dir = 'neutron_D2Op_100'
sas_ext = '*.iq'
saxs_search = os.path.join(sas_dir, saxs_dir, sas_ext)
sans_search = os.path.join(sas_dir, sans_dir, sas_ext)
print(saxs_search)
print(sans_search)


pr_ext = '*.pr'
pr_search = os.path.join(pr_dir, pr_ext)
print(pr_search)


/home/schowell/data/scratch/sas_clustering/sascalc/xray/*.iq
/home/schowell/data/scratch/sas_clustering/sascalc/neutron_D2Op_100/*.iq
/home/schowell/data/scratch/sas_clustering/pr/*.pr

In [12]:
saxs_files = glob.glob(saxs_search)
sans_files = glob.glob(sans_search)
pr_files = glob.glob(pr_search)
saxs_files.sort()
sans_files.sort()
pr_files.sort()
n_saxs = len(saxs_files)
n_sans = len(sans_files)
n_pr = len(pr_files)
print('Found {} SAXS data files'.format(n_saxs))
print('Found {} SANS data files'.format(n_sans))
print('Found {} P(r) data files'.format(n_pr))


Found 200 SAXS data files
Found 200 SANS data files
Found 200 P(r) data files

In [13]:
saxs_data = []

# load in the first data set to setup the q-mask
first_data = np.loadtxt(saxs_files[0])
q_mask = first_data[:, 0] <= 0.18  # only use data up to 0.18 1/A
first_data[:, 1] /= first_data[0, 1]  # normalize I(0), to 1
first_data = first_data[q_mask]
saxs_data.append(first_data[1:, 1])  # do not use I(0), it is the same for every dataset

# load in the rest of the data
for saxs_file in saxs_files[1:]:
    x_data = np.loadtxt(saxs_file)
    x_data[:, 1] /= x_data[0, 1]
    x_data = x_data[q_mask]
    assert np.allclose(x_data[0, 1], first_data[0, 1]), 'ERROR: data not normalized to I(0)'
    assert np.allclose(x_data[:, 0], first_data[:, 0]), 'ERROR: data not on same Q-grid'
    saxs_data.append(x_data[1:, 1])

saxs_data = np.array(saxs_data)

In [14]:
sans_data = []

# load in the first data set to setup the q-mask
first_data = np.loadtxt(sans_files[0])
q_mask = first_data[:, 0] <= 0.18  # only use data up to 0.18 1/A
first_data[:, 1] /= first_data[0, 1]  # normalize I(0), to 1
first_data = first_data[q_mask]
sans_data.append(first_data[1:, 1])  # do not use I(0), it is the same for every dataset

# load in the rest of the data
for sans_file in sans_files[1:]:
    n_data = np.loadtxt(sans_file)
    n_data[:, 1] /= n_data[0, 1]
    n_data = n_data[q_mask]
    assert np.allclose(n_data[0, 1], first_data[0, 1]), 'ERROR: data not normalized'
    assert np.allclose(n_data[:, 0], first_data[:, 0]), 'ERROR: data not on same Q-grid'
    sans_data.append(n_data[1:, 1])
    
sans_data = np.array(sans_data)

Store the Q-values


In [15]:
q_saxs = x_data[1:, 0]
q_sans = n_data[1:, 0]
print(q_saxs)    
print(q_sans)


[ 0.01  0.02  0.03  0.04  0.05  0.06  0.07  0.08  0.09  0.1   0.11  0.12
  0.13  0.14  0.15  0.16  0.17  0.18]
[ 0.01  0.02  0.03  0.04  0.05  0.06  0.07  0.08  0.09  0.1   0.11  0.12
  0.13  0.14  0.15  0.16  0.17  0.18]

In [16]:
np.allclose(q_saxs, q_sans)


Out[16]:
True

Load the P(r) data


In [17]:
pr_data_l = []
n_pr = len(pr_files)
n_r = np.empty(n_pr, dtype=int)

# load in all the data set
for i, pr_file in enumerate(pr_files):
    n_data = np.loadtxt(pr_file, delimiter=',', dtype=int)
    pr_data_l.append(n_data[:, 1])
    n_r[i] = len(n_data)
    
r_max = n_r.max()
pr_data = np.zeros([n_pr, r_max], dtype=int)
for i, n_data in enumerate(pr_data_l):
    pr_data[i, :len(n_data)] = n_data
    
#pr_data = np.array(sans_data)

In [18]:
pr_data[:3, :8]


Out[18]:
array([[     0,  17067,  50223,  97762, 154904, 252311, 345445, 436965],
       [     0,  17068,  50248,  97863, 155068, 252631, 345934, 437560],
       [     0,  17068,  50252,  97875, 155072, 252639, 345932, 437629]])

Visualize the data

Create a dataframe for plotting purposes (each column will be a different I(Q) or P(r) curve)


In [19]:
saxs_df = pd.DataFrame(data=saxs_data.T)
saxs_cols = ['c{}'.format(c) for c in saxs_df.columns]
saxs_df.columns = saxs_cols
saxs_df['q'] = q_saxs
saxs_df.head()


Out[19]:
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ... c191 c192 c193 c194 c195 c196 c197 c198 c199 q
0 0.917798 0.917891 0.917723 0.917790 0.916672 0.916286 0.917395 0.917101 0.917916 0.920588 ... 0.929958 0.929471 0.928210 0.929580 0.929479 0.920882 0.918059 0.915479 0.915412 0.01
1 0.711798 0.712118 0.711597 0.711866 0.707277 0.706311 0.709269 0.708664 0.710933 0.718504 ... 0.749546 0.748034 0.744269 0.748370 0.748000 0.719319 0.710378 0.702664 0.702479 0.02
2 0.474336 0.474924 0.474168 0.474622 0.464832 0.464067 0.467067 0.467008 0.469496 0.478496 ... 0.528697 0.526529 0.521513 0.526983 0.526311 0.479227 0.465958 0.455723 0.455479 0.03
3 0.287303 0.288050 0.287303 0.287555 0.273513 0.273714 0.274563 0.275706 0.276546 0.281588 ... 0.335605 0.333647 0.329756 0.334017 0.333134 0.281017 0.268790 0.261378 0.261134 0.04
4 0.179109 0.179857 0.179261 0.178765 0.164588 0.165555 0.164193 0.166034 0.164571 0.163933 ... 0.203950 0.202840 0.201395 0.203008 0.202101 0.161176 0.154756 0.153420 0.153134 0.05

5 rows × 201 columns


In [20]:
sans_df = pd.DataFrame(data=sans_data.T)
sans_cols = ['c{}'.format(c) for c in sans_df.columns]
sans_df.columns = sans_cols
sans_df['q'] = q_sans
sans_df.head()


Out[20]:
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ... c191 c192 c193 c194 c195 c196 c197 c198 c199 q
0 0.919420 0.919513 0.919345 0.919370 0.918168 0.917815 0.918866 0.918630 0.919370 0.921933 ... 0.931681 0.931202 0.929950 0.931303 0.931202 0.922336 0.919496 0.916933 0.916849 0.01
1 0.716874 0.717202 0.716681 0.716840 0.711958 0.711076 0.713891 0.713445 0.715496 0.722782 ... 0.755151 0.753639 0.749891 0.753966 0.753597 0.723899 0.714882 0.707126 0.706891 0.02
2 0.481966 0.482563 0.481807 0.482134 0.471891 0.471176 0.474059 0.474151 0.476361 0.485084 ... 0.537605 0.535420 0.530387 0.535857 0.535176 0.486118 0.472580 0.462151 0.461815 0.03
3 0.295185 0.295966 0.295202 0.295420 0.280815 0.281000 0.281840 0.283017 0.283639 0.288622 ... 0.345370 0.343370 0.339437 0.343731 0.342824 0.287950 0.275244 0.267462 0.267109 0.04
4 0.185849 0.186655 0.186042 0.185639 0.170824 0.171731 0.170420 0.172176 0.170622 0.170092 ... 0.212361 0.211193 0.209739 0.211353 0.210403 0.166605 0.159647 0.157966 0.157597 0.05

5 rows × 201 columns


In [21]:
pr_df = pd.DataFrame(data=pr_data.T)
pr_cols = ['c{}'.format(c) for c in pr_df.columns]
pr_df.columns = pr_cols
r = np.arange(r_max)
pr_df['r'] = r
pr_df.head()


Out[21]:
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ... c191 c192 c193 c194 c195 c196 c197 c198 c199 r
0 0 0 0 0 0 0 1 0 0 0 ... 2 1 1 1 2 0 0 0 0 0
1 17067 17068 17068 17067 17070 17066 17073 17069 17076 17079 ... 17101 17100 17099 17098 17091 17066 17066 17066 17066 1
2 50223 50248 50252 50265 50216 50207 50226 50219 50231 50226 ... 50324 50310 50304 50312 50303 50202 50198 50199 50200 2
3 97762 97863 97875 97883 97600 97613 97605 97608 97604 97633 ... 97901 97878 97836 97866 97828 97547 97542 97555 97547 3
4 154904 155068 155072 155081 154428 154434 154439 154443 154385 154410 ... 154899 154866 154798 154872 154806 154143 154154 154178 154185 4

5 rows × 201 columns


In [22]:
q_range = q_saxs.min(), q_saxs.max()
pq_range = saxs_data.min(), saxs_data.max()
r_range = 0, r_max
pr_range = pr_data.min(), pr_data.max()

In [28]:
%%time
sans_canvas = datashader.Canvas(x_range=q_range, y_range=pq_range, plot_height=300, plot_width=300)
# create an ordered-dictionary of each line, aggregated
sans_aggs = collections.OrderedDict((c, sans_canvas.line(sans_df, 'q', c)) for c in sans_cols)
sans_merged = xarray.concat(sans_aggs.values(), dim=pd.Index(sans_cols, name='cols'))
sans_img = datashader.transfer_functions.shade(sans_merged.sum(dim='cols'), how='eq_hist')


CPU times: user 1min 31s, sys: 892 ms, total: 1min 32s
Wall time: 1min 32s

In [29]:
%%time
saxs_canvas = datashader.Canvas(x_range=q_range, y_range=pq_range, plot_height=300, plot_width=300)
# create an ordered-dictionary of each line, aggregated
saxs_aggs = collections.OrderedDict((c, saxs_canvas.line(saxs_df, 'q', c)) for c in saxs_cols)
saxs_merged = xarray.concat(saxs_aggs.values(), dim=pd.Index(saxs_cols, name='cols'))
saxs_img = datashader.transfer_functions.shade(saxs_merged.sum(dim='cols'), how='eq_hist')


CPU times: user 880 ms, sys: 24 ms, total: 904 ms
Wall time: 900 ms

In [30]:
%%time
pr_canvas = datashader.Canvas(x_range=r_range, y_range=pr_range, plot_height=300, plot_width=300)
# create an ordered-dictionary of each line, aggregated
pr_aggs = collections.OrderedDict((c, pr_canvas.line(pr_df, 'r', c)) for c in pr_cols)
pr_merged = xarray.concat(pr_aggs.values(), dim=pd.Index(pr_cols, name='cols'))
pr_img = datashader.transfer_functions.shade(pr_merged.sum(dim='cols'), how='eq_hist')


CPU times: user 2min 13s, sys: 516 ms, total: 2min 14s
Wall time: 2min 14s

In [172]:
def saxs_image(q_range, pq_range, w, h):
    saxs_canvas = datashader.Canvas(x_range=q_range, y_range=pq_range, 
                                    plot_height=h, plot_width=w)
    saxs_aggs = collections.OrderedDict((c, saxs_canvas.line(saxs_df, 'q', c)) for c in saxs_cols)
    saxs_merged = xarray.concat(saxs_aggs.values(), dim=pd.Index(saxs_cols, name='cols'))
    saxs_img = datashader.transfer_functions.shade(saxs_merged.sum(dim='cols'), how='eq_hist')
    return saxs_img

p = bokeh.plotting.figure(x_range=q_range, y_range=pq_range,
                          plot_width=400, plot_height=300, 
                          x_axis_label='Q (1/A)', y_axis_label='P(Q)')
InteractiveImage(p, saxs_image)


Out[172]:

In [173]:
def sans_image(q_range, pq_range, w, h):
    sans_canvas = datashader.Canvas(x_range=q_range, y_range=pq_range, 
                                    plot_height=h, plot_width=w)
    sans_aggs = collections.OrderedDict((c, sans_canvas.line(saxs_df, 'q', c)) for c in sans_cols)
    sans_merged = xarray.concat(sans_aggs.values(), dim=pd.Index(sans_cols, name='cols'))
    sans_img = datashader.transfer_functions.shade(sans_merged.sum(dim='cols'), how='eq_hist')
    return sans_img

p = bokeh.plotting.figure(x_range=q_range, y_range=pq_range,
                          plot_width=400, plot_height=300, 
                          x_axis_label='Q (1/A)', y_axis_label='P(Q)')
InteractiveImage(p, sans_image)


Out[173]:

In [33]:
def pr_image(r_range, pr_range, w, h):
    pr_canvas = datashader.Canvas(x_range=r_range, y_range=pr_range, 
                                  plot_height=h, plot_width=w)
    pr_aggs = collections.OrderedDict((c, pr_canvas.line(pr_df, 'r', c)) for c in pr_cols)
    pr_merged = xarray.concat(pr_aggs.values(), dim=pd.Index(pr_cols, name='cols'))
    pr_img = datashader.transfer_functions.shade(pr_merged.sum(dim='cols'), how='eq_hist')
    return pr_img

p = bokeh.plotting.figure(x_range=r_range, y_range=pr_range,
                          plot_width=400, plot_height=300, 
                          x_axis_label='r (A)', y_axis_label='P(r)')
InteractiveImage(p, pr_image)


Out[33]:

In [34]:
sans_img


Out[34]:

In [35]:
saxs_img


Out[35]:

In [36]:
pr_img


Out[36]:

Manipulate the data


In [23]:
n_samples, n_features = saxs_data.shape # for PCA, should be (n_samples, n_features)
print('# samples: {}\n# features: {}'.format(n_samples, n_features))


# samples: 200
# features: 18

Each row in the NumPy array is a separate scattering pattern or pair-distance distribution curve. Each column with be a feature we explore.


In [24]:
print(saxs_data[:3, :5])
print(sans_data[:3, :5])
print(pr_data[:3, :5])


[[ 0.91779832  0.71179832  0.47433613  0.28730252  0.17910924]
 [ 0.91789076  0.71211765  0.47492437  0.28805042  0.17985714]
 [ 0.91772269  0.71159664  0.47416807  0.28730252  0.1792605 ]]
[[ 0.91942017  0.71687395  0.48196639  0.29518487  0.18584874]
 [ 0.91951261  0.71720168  0.48256303  0.29596639  0.18665546]
 [ 0.91934454  0.71668067  0.48180672  0.29520168  0.18604202]]
[[     0  17067  50223  97762 154904]
 [     0  17068  50248  97863 155068]
 [     0  17068  50252  97875 155072]]

In [25]:
min_vals = saxs_data.min(axis=0)
max_vals = saxs_data.max(axis=0)
saxs_range = max_vals - min_vals
print(saxs_range)


[ 0.02576471  0.07929412  0.11215126  0.0959916   0.06180672  0.03790756
  0.0335042   0.02414286  0.01343697  0.00860504  0.00817647  0.00815966
  0.00719328  0.00639496  0.00536134  0.00468067  0.00431933  0.00436975]

In [26]:
min_vals = sans_data.min(axis=0)
max_vals = sans_data.max(axis=0)
sans_range = max_vals - min_vals
print(sans_range)


[ 0.02515126  0.07772269  0.1107395   0.0959916   0.0642437   0.03981513
  0.03482353  0.02517647  0.01416807  0.00884874  0.00886555  0.00848739
  0.00810084  0.00784034  0.0067395   0.00586555  0.00514286  0.00516807]

Notice how the variation depends significantly on Q.


In [27]:
min_vals = pr_data.min(axis=0)
max_vals = pr_data.max(axis=0)
pr_range = max_vals - min_vals
print(pr_range)


[      2      48     247     684    1449    2601    4696    7199   10348
   14852   20324   26448   33483   42035   51892   63056   76965   91133
  107671  126696  147179  169099  191703  215318  241866  270704  301275
  332337  363255  393855  423570  454518  480211  505771  528938  554157
  581678  610204  632516  657517  678458  701593  720512  736965  751249
  761955  773427  780324  794875  814249  835276  852393  865353  875525
  887152  893570  895052  892918  891731  887365  889148  882761  882142
  874728  862332  841209  832268  814598  794713  775237  763091  755711
  737816  734890  748581  763052  764722  762752  744522  743587  746469
  744480  734704  744036  753297  758225  759160  765943  794522  814081
  836536  857317  874178  886792  905220  920659  930358  923466  914218
  904550  939833  993162 1029340 1072526 1101818 1130918 1143524 1155555
 1158588 1161896 1156140 1148066 1130239 1109465 1080455 1048508 1016441
  980527  943196  915941  895630  874439  848686  824559  796117  768513
  737693  706352  672474  638126  604399  569085  533381  499646  466126
  433257  401017  371104  340401  310902  283300  256205  230039  204547
  179687  156655  134247  113765   95394   78563   63384   50582   40122
   31690   24280   18434   13555    9796    6520    4508    3062    1961
    1206     668     449     233      97      49      20       3]

In [179]:
range_pq = bokeh.plotting.figure(x_axis_label='Q (1/A)', y_axis_label='P(Q) range',
                                 width=400, height=300, title='Data Range')
range_pr = bokeh.plotting.figure(x_axis_label='r (A)', y_axis_label='P(r) range',
                                 width=400, height=300, title='Data Range')

range_pq.line(q_saxs, saxs_range, legend='SAXS range', color=palette[0])
range_pq.line(q_sans, sans_range, legend='SANS range', color=palette[1])

range_pr.line(r, pr_range, legend='P(c) range', color=palette[3])
range_pr.legend.location = "bottom_center"

plots = bokeh.layouts.gridplot([[range_pq, range_pr]])
show(plots)


Rescale the data

Originally used StandardScaler but changed to RobustScaler to avoid complications from outliers (which skew the mean)


In [180]:
from sklearn.preprocessing import StandardScaler, RobustScaler
saxs_scaler = RobustScaler()
sans_scaler = RobustScaler()
pr_scaler = RobustScaler()

In [181]:
saxs_scaler.fit(saxs_data)
sans_scaler.fit(sans_data)
pr_scaler.fit(pr_data)


Out[181]:
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)

In [3]:
import sklearn.preprocessing

In [4]:
scaler = sklearn.preprocessing.RobustScaler

In [182]:
scaled_saxs = saxs_scaler.transform(saxs_data)
scaled_sans = sans_scaler.transform(sans_data)
scaled_pr = pr_scaler.transform(pr_data)
print(scaled_saxs[:3, :5])
print(scaled_sans[:3, :5])
print(scaled_pr[:3, :5])


[[-0.49091402 -0.40799064 -0.32635368 -0.05543155  0.90684171]
 [-0.4789802  -0.39431297 -0.30578209 -0.02232143  0.95177985]
 [-0.50067806 -0.41662917 -0.33223128 -0.05543155  0.91593032]]
[[-0.48020362 -0.40078512 -0.2940716  -0.03060129  0.90528101]
 [-0.46776018 -0.38620432 -0.27385951  0.00268432  0.95179264]
 [-0.49038462 -0.40938405 -0.29948046 -0.02988547  0.91642442]]
[[ 0.         -0.5         1.08235294  4.81218274  5.23509934]
 [ 0.         -0.3         2.25882353  6.86294416  6.32119205]
 [ 0.         -0.3         2.44705882  7.10659898  6.34768212]]

In [183]:
min_vals = scaled_saxs.min(axis=0)
max_vals = scaled_saxs.max(axis=0)
saxs_scaled_range = max_vals - min_vals

min_vals = scaled_sans.min(axis=0)
max_vals = scaled_sans.max(axis=0)
sans_scaled_range = max_vals - min_vals

min_vals = scaled_pr.min(axis=0)
max_vals = scaled_pr.max(axis=0)
pr_scaled_range = max_vals - min_vals

print(saxs_scaled_range)
print(sans_scaled_range)
print(pr_scaled_range)


[ 3.32628153  3.39638261  3.92212181  4.24962798  3.71370866  2.55943262
  2.84531668  3.72512156  3.16947473  5.27155727  6.76869565  5.46272855
  4.29611041  4.51632047  2.93333333  3.23367199  3.85018727  4.3153527 ]
[ 3.38574661  3.45789326  3.75147676  4.08840372  3.70397287  2.6151511
  2.778411    3.70562771  3.14405594  4.47133758  7.08053691  4.73622509
  3.96707819  5.22689076  3.37684211  3.18721461  3.80715397  4.3772242 ]
[   2.            9.6          11.62352941   13.88832487    9.59602649
    7.10655738    7.0696274     5.72143851    4.9036844     4.75720692
    4.67809875    4.41664927    4.06692579    3.93650645    3.85498849
    3.83266726    3.77071468    3.64623156    3.53147037    3.50613662
    3.43829976    3.37267143    3.39069299    3.4764736     3.57107158
    3.69373936    3.77897496    3.88733482    4.04068999    4.15430377
    4.24942564    4.2966205     4.29729591    4.21608464    4.10279919
    4.10102386    4.12418397    4.36057783    4.66456367    5.00752633
    4.87748382    4.91452728    4.83880922    4.59214003    4.46796369
    4.25153025    4.19059668    4.08603823    3.89385972    3.80759298
    3.7769744     3.90241591    3.77046987    3.54606769    3.36517564
    3.18799965    3.12309332    3.06850209    3.01848108    3.00673781
    3.08340096    3.14077171    3.05241031    3.04496285    3.05053466
    3.05540033    3.15253627    3.30358169    3.40978662    3.34119675
    3.33836507    3.3060321     3.19216038    3.32404572    3.14076862
    3.12045229    3.13010733    3.08045717    3.09392132    3.08186183
    2.96431064    2.70891647    2.49534355    2.58952754    2.67475405
    2.65975499    2.7164664     2.67946914    2.66936564    2.75566162
    2.80011113    2.88250325    2.85116539    2.81957409    2.90404077
    3.0017272     3.07760605    3.04786328    2.95533944    3.02555693
    3.31146614    3.5200224     3.65808716    3.78568428    3.97231896
    3.86431545    3.59374698    3.57668933    3.70474602    3.42744205
    3.20880594    3.05728417    3.04380605    2.95124245    2.76822899
    2.60975446    2.55453296    2.43585742    2.42520766    2.37195936
    2.36747657    2.36696467    2.4078521     2.46859355    2.59741145
    2.75227032    2.89919816    2.93103974    2.94866171    3.03083311
    3.13202702    3.23462292    3.3504463     3.54506576    3.64991436
    3.83020086    3.88503308    3.99165322    4.08338282    4.35308943
    4.68496775    5.13133518    5.63327477    6.18018113    6.81917629
    7.61922132    8.56604135    9.64416658   11.4956768    13.87487306
   15.24111812   14.96066253   14.33440514   13.61692985   12.68050659
   11.9895935    11.40993266   11.07518372   10.96258932   11.58868895
   13.7309417    17.50892857   20.61538462   19.94029851   54.42424242
  233.           97.           49.           20.            3.        ]

Notice how rescaling the data using the RobustScaler has removed the $Q$ or $r$ dependence from the ranges.


In [184]:
range_pq0 = bokeh.plotting.figure(x_axis_label='Q (1/A)', y_axis_label='P(Q) range',
                                 width=400, height=300, title='Range Before')
range_pr0 = bokeh.plotting.figure(x_axis_label='r (A)', y_axis_label='P(r) range',
                                 width=400, height=300, title='Comparison')
range_pq1 = bokeh.plotting.figure(x_axis_label='Q (1/A)', y_axis_label='P(Q) range',
                                 width=400, height=300, title='Comparison')
range_pr1 = bokeh.plotting.figure(x_axis_label='r (A)', y_axis_label='P(r) range',
                                 width=400, height=300, title='Range After')

range_pq0.line(q_saxs, saxs_range, legend='SAXS before', color=palette[0])
range_pq0.line(q_sans, sans_range, legend='SANS before', color=palette[1])
range_pq1.line(q_saxs, saxs_range, legend='SAXS before', color=palette[0])
range_pq1.line(q_sans, sans_range, legend='SANS before', color=palette[1])

range_pq1.line(q_saxs, saxs_scaled_range, legend='SAXS after', color=palette[0], line_dash= [4, 2])
range_pq1.line(q_sans, sans_scaled_range, legend='SANS after', color=palette[1], line_dash= [4, 2])


range_pr0.line(r, pr_range, legend='before', color=palette[3])
range_pr0.line(r, pr_scaled_range, legend='after', color=palette[3], line_dash= [4, 2])
range_pr1.line(r, pr_scaled_range, legend='after', color=palette[3], line_dash= [4, 2])

range_pr0.legend.location = "bottom_center"
range_pr1.legend.location = "top_left"

plots = bokeh.layouts.gridplot([[range_pq0, range_pr0],
                                [range_pq1, range_pr1]])
show(plots)


Note that for P(r), the last 5 points in almost every data set are zero, so the range did not change at all. See this below


In [48]:
pr_data[:, -5:]


Out[48]:
array([[  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [ 33,  10,   1,   0,   0],
       [ 12,   1,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  3,   0,   0,   0,   0],
       [ 21,   2,   0,   0,   0],
       [ 15,   1,   0,   0,   0],
       [  9,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [233,  97,  49,  20,   3],
       [ 62,  37,  17,   0,   0],
       [123,  41,  32,   7,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  7,   0,   0,   0,   0],
       [ 25,   2,   0,   0,   0],
       [140,  51,  42,  18,   1],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  3,   0,   0,   0,   0],
       [  3,   0,   0,   0,   0],
       [  1,   0,   0,   0,   0],
       [  1,   0,   0,   0,   0],
       [ 10,   2,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [ 19,   5,   1,   0,   0],
       [ 10,   1,   0,   0,   0],
       [ 10,   1,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  5,   0,   0,   0,   0],
       [ 13,   2,   0,   0,   0],
       [ 11,   0,   0,   0,   0],
       [ 11,   2,   0,   0,   0],
       [ 11,   2,   0,   0,   0],
       [ 11,   2,   0,   0,   0],
       [ 11,   2,   0,   0,   0],
       [ 13,   2,   0,   0,   0],
       [ 13,   2,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  2,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  2,   0,   0,   0,   0],
       [  5,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  4,   0,   0,   0,   0],
       [  6,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  1,   0,   0,   0,   0],
       [ 14,   3,   0,   0,   0],
       [ 14,   3,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0],
       [  5,   0,   0,   0,   0]])

In [49]:
import matplotlib.pyplot as plt
%matplotlib inline

In [50]:
def compare_q1_q2(q1, q2):
    plt.figure()
    plt.plot(saxs_data[:,q1], saxs_data[:,q2], 'bo', markersize=5)
    plt.title('original data')
    plt.xlabel(r'$Q_{}$'.format(q1))
    plt.ylabel(r'$Q_{}$'.format(q2))
    plt.figure()
    plt.plot(scaled_saxs[:,q1], scaled_saxs[:,q2], 'bo', markersize=5)
    plt.xlabel(r'$Q_{}$'.format(q1))
    plt.ylabel(r'$Q_{}$'.format(q2))
    plt.title('scaled data')

In [51]:
compare_q1_q2(0, 8)



In [52]:
plt.figure()
plt.plot(saxs_data[:,0], saxs_data[:,15], 'bo', markersize=5)
plt.title('original data')
plt.xlabel(r'$Q_{}$'.format(0))
plt.ylabel(r'$Q_{%d}$'%15)
plt.figure()
plt.plot(scaled_saxs[:,0], scaled_saxs[:,17], 'bo', markersize=5)
plt.xlabel(r'$Q_{}$'.format(0))
plt.ylabel(r'$Q_{}$'.format(17))
plt.title('scaled data')


Out[52]:
<matplotlib.text.Text at 0x7f68899b00f0>

In [53]:
scaled_saxs.shape


Out[53]:
(200, 18)

In [54]:
i0 = 2
i_compare = 0
for i0 in range(18):
    plt.figure()
    plt.plot(scaled_saxs[:,i0], scaled_saxs[:, i_compare], 'bo')
    plt.plot(scaled_saxs[112,i0], scaled_saxs[112, i_compare], 'rs')
    plt.plot(scaled_saxs[113,i0], scaled_saxs[113, i_compare], 'gs')
    plt.xlabel(r'$Q_{{}}$'.format(i0))
    plt.ylabel(r'$Q_{{}}$'.format(i_compare))


DBSCAN


In [185]:
from sklearn.cluster import DBSCAN
from sklearn import metrics

Need to determine what distance, epsilon, to set as the minimum distance for defining core points. To achieve this we will use the $k$-dist plot of the distance to the $k^{th}$ nearest neighbor:

  1. Plot the sorted $k$-dist
  2. Find the point of greatest curvature, largest 2$^{nd}$ derivative
  3. Use that distance

In [32]:
from scipy.spatial.distance import pdist, cdist

In [131]:
scaled_saxs


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-131-97364101979b> in <module>()
----> 1 scaled_saxs

NameError: name 'scaled_saxs' is not defined

In [132]:
dist = np.zeros([n_samples, n_samples])
i = np.arange(n_samples)
dist[i[:, None] < i] = pdist(scaled_saxs)  # get the distances, consider other distance metrics

i_upper = np.triu_indices(n_samples, 1)
i_lower = np.tril_indices(n_samples, -1)
dist[i_lower] = dist[i_upper]  # make the matrix symmetric

np.fill_diagonal(dist, np.inf)  # fill the diagonal with large value for simplicity


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-132-0fd30e5a2135> in <module>()
      1 dist = np.zeros([n_samples, n_samples])
      2 i = np.arange(n_samples)
----> 3 dist[i[:, None] < i] = pdist(scaled_saxs)  # get the distances, consider other distance metrics
      4 
      5 i_upper = np.triu_indices(n_samples, 1)

NameError: name 'scaled_saxs' is not defined

In [133]:
k_dist = dist.min(axis=1)
k_dist.sort()

In [134]:
k_dist[:10]


Out[134]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [135]:
raw_dist = np.zeros([n_samples, n_samples])
i = np.arange(n_samples)
raw_dist[i[:, None] < i] = pdist(saxs_data)  # get the distances, consider other distance metrics

i_upper = np.triu_indices(n_samples, 1)
i_lower = np.tril_indices(n_samples, -1)
raw_dist[i_lower] = raw_dist.T[i_lower]  # make the matrix symmetric
assert np.allclose(raw_dist.T, raw_dist)

np.fill_diagonal(raw_dist, np.inf)  # fill the diagonal with large value for simplicity

In [136]:
raw_k_dist = raw_dist.min(axis=0)
raw_k_dist.sort()

In [137]:
raw_k_dist.shape


Out[137]:
(200,)

In [138]:
raw_k_dist[:10]


Out[138]:
array([ 0.00001456,  0.00001456,  0.00009051,  0.00009051,  0.00015495,
        0.00015495,  0.00038965,  0.00038965,  0.00064853,  0.00064853])

In [139]:
raw_dist[:, 4].min()


Out[139]:
0.0039112840329649104

In [140]:
raw_dist[:5, :5]


Out[140]:
array([[        inf,  0.0014716 ,  0.00104904,  0.00283074,  0.02590177],
       [ 0.0014716 ,         inf,  0.00166125,  0.00340055,  0.02734865],
       [ 0.00104904,  0.00166125,         inf,  0.00257087,  0.02609299],
       [ 0.00283074,  0.00340055,  0.00257087,         inf,  0.02566253],
       [ 0.02590177,  0.02734865,  0.02609299,  0.02566253,         inf]])

In [141]:
raw_dist[4, :].min()


Out[141]:
0.0039112840329649104

In [142]:
raw_dist0 = np.zeros(n_samples)
ind = np.arange(n_samples)
for i in ind:
    dist = cdist(saxs_data[i].reshape([1, -1]), saxs_data[ind!=i])
    raw_dist0[i] = dist.min()

In [143]:
raw_dist0.sort()
raw_dist0[:10]


Out[143]:
array([ 0.00001456,  0.00001456,  0.00009051,  0.00009051,  0.00015495,
        0.00015495,  0.00038965,  0.00038965,  0.00064853,  0.00064853])

In [144]:
cdist(saxs_data[4].reshape([1, -1]), saxs_data[ind!=4]).min()


Out[144]:
0.0039112840329649104

In [145]:
raw_k_dist-raw_dist0


Out[145]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.])

In [146]:
k_plot = bokeh.plotting.figure(y_axis_label='distance')

k_plot.line(ind, raw_k_dist, color=palette[0], 
            line_dash=[4,2], legend='pdist')
k_plot.line(ind, raw_dist0, color=palette[1], 
            line_dash=[8,4], legend='cdist')

k_plot.legend.location = "top_left"

bokeh.plotting.show(k_plot)



In [148]:
def smooth_diff_n5(x, y):
    h_all = x[1:] - x[:-1]
    h = h_all[0]
    assert np.allclose(h_all, h), 'inconsistent dx'
    dy = (2 * (y[3:-1] - y[1:-3]) + y[4:] - y[:-4]) / (8 * h)
    
    dydx = np.vstack([x[2:-2], dy]).T
    return dydx

In [149]:
def smooth_diff_n7(x, y):
    h_all = x[1:] - x[:-1]
    h = h_all[0]
    assert np.allclose(h_all, h), 'inconsistent dx'
    
    dy = (5 * (y[4:-2] - y[2:-4]) + 4 * (y[5:-1] - y[1:-5]) + y[6:] - y[:-6]) / (32 * h)
    
    dydx = np.vstack([x[3:-3], dy]).T
    return dydx

In [153]:
ind


Out[153]:
array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
       195, 196, 197, 198, 199])

In [165]:
raw_k_dist


Out[165]:
array([ 0.00001456,  0.00001456,  0.00009051,  0.00009051,  0.00015495,
        0.00015495,  0.00038965,  0.00038965,  0.00064853,  0.00064853,
        0.00065081,  0.00065081,  0.00067531,  0.00067531,  0.00068687,
        0.00068687,  0.00071902,  0.00071902,  0.00073744,  0.00073744,
        0.00082344,  0.00082344,  0.00085574,  0.00085574,  0.00086444,
        0.00086444,  0.0009241 ,  0.0009241 ,  0.00103314,  0.00103314,
        0.00104904,  0.00104904,  0.00117866,  0.00117866,  0.00119248,
        0.00122225,  0.00127235,  0.00127235,  0.00127382,  0.00127382,
        0.00135341,  0.00135341,  0.00137713,  0.00141955,  0.00141955,
        0.0014716 ,  0.00147581,  0.00147581,  0.00150385,  0.00160462,
        0.00169286,  0.00169286,  0.00188305,  0.00188305,  0.00190914,
        0.00190914,  0.00191853,  0.00204316,  0.00220519,  0.00220519,
        0.00221297,  0.00221297,  0.0023853 ,  0.00252364,  0.00257087,
        0.00259429,  0.00263096,  0.00265477,  0.00265477,  0.00269574,
        0.00269574,  0.00274866,  0.00274866,  0.00280041,  0.00282833,
        0.00283103,  0.00283103,  0.00285891,  0.00285891,  0.00288584,
        0.00296025,  0.00296025,  0.0030044 ,  0.00303831,  0.00306513,
        0.00313283,  0.00313283,  0.00318331,  0.00318331,  0.00319813,
        0.0032523 ,  0.0033773 ,  0.0033773 ,  0.00345248,  0.00348461,
        0.00348461,  0.00349861,  0.00349861,  0.00359851,  0.00364449,
        0.00364449,  0.00366362,  0.00366362,  0.00367506,  0.00372543,
        0.0037328 ,  0.00375183,  0.00375183,  0.00377693,  0.00377693,
        0.00378119,  0.00378119,  0.00379734,  0.00388158,  0.00388158,
        0.00391128,  0.00391128,  0.00392403,  0.00393383,  0.00394925,
        0.00399038,  0.00399038,  0.00405248,  0.00414293,  0.00415781,
        0.00429495,  0.00434333,  0.00434333,  0.00436079,  0.00436079,
        0.00442148,  0.00443727,  0.00443727,  0.00447021,  0.00453043,
        0.0045865 ,  0.00461872,  0.00470169,  0.00477683,  0.00478941,
        0.00478941,  0.0048848 ,  0.00489099,  0.00489661,  0.00491209,
        0.00494634,  0.00500364,  0.00500364,  0.00502461,  0.00504536,
        0.00504536,  0.00505301,  0.00507093,  0.00508292,  0.00508292,
        0.00511697,  0.00529451,  0.0054369 ,  0.0054369 ,  0.00557138,
        0.00561624,  0.00572227,  0.00584073,  0.0059103 ,  0.00599123,
        0.00621831,  0.00621894,  0.0063196 ,  0.00633555,  0.00644566,
        0.00654345,  0.00655973,  0.00655973,  0.00667972,  0.00675611,
        0.00686773,  0.00693923,  0.00711057,  0.00721525,  0.00721525,
        0.00724698,  0.00724698,  0.00742144,  0.00808267,  0.0082362 ,
        0.0084389 ,  0.00865306,  0.00865306,  0.00889643,  0.00948668,
        0.00950688,  0.00971971,  0.01017545,  0.01043254,  0.01155862,
        0.01228613,  0.01228613,  0.0125944 ,  0.01294705,  0.01950195])

In [170]:
i2


Out[170]:
array([  0,   2,   4,   6,   8,  10,  12,  14,  16,  18,  20,  22,  24,
        26,  28,  30,  32,  34,  36,  38,  40,  42,  44,  46,  48,  50,
        52,  54,  56,  58,  60,  62,  64,  66,  68,  70,  72,  74,  76,
        78,  80,  82,  84,  86,  88,  90,  92,  94,  96,  98, 100, 102,
       104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128,
       130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154,
       156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180,
       182, 184, 186, 188, 190, 192, 194, 196, 198])

In [171]:
ind[::2]


Out[171]:
array([  0,   2,   4,   6,   8,  10,  12,  14,  16,  18,  20,  22,  24,
        26,  28,  30,  32,  34,  36,  38,  40,  42,  44,  46,  48,  50,
        52,  54,  56,  58,  60,  62,  64,  66,  68,  70,  72,  74,  76,
        78,  80,  82,  84,  86,  88,  90,  92,  94,  96,  98, 100, 102,
       104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128,
       130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154,
       156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180,
       182, 184, 186, 188, 190, 192, 194, 196, 198])

In [ ]:
tip_to_tail =

In [175]:
k_plot = bokeh.plotting.figure(y_axis_label='distance')

# k_plot.line(i, k_dist/k_dist.max(), color=palette[0], legend='scaled')
k_plot.line(ind[::2], raw_k_dist[::2]/raw_k_dist.max(), color=palette[0], 
            line_dash=[4,2], legend='raw')

n = 1
i2 = ind[::2]
exponents = np.polyfit(i2[:75], raw_k_dist[:150:2]/raw_k_dist.max(), n)
fit = np.poly1d(exponents)
y = fit(ind[::2])
k_plot.line(i2, y, color=palette[1], legend='fit ({})'.format(n))

d1_norm = smooth_diff_n5(ind[::2], raw_k_dist[::2]/raw_k_dist.max())
# k_plot.line(d1_norm[:, 0], d1_norm[:, 1], color=palette[2], line_dash=[4,2], legend='d1')

d1_fit = np.empty([n_samples/2-1, 2])
d1_fit[:, 0] = i2[:-1]
d1_fit[:, 1] = np.diff(y)/np.diff(i2)
k_plot.line(d1_fit[:, 0], d1_fit[:, 1], color=palette[3], legend='d1 fit')

d2 = smooth_diff_n5(d1_norm[:, 0], d1_norm[:, 1])
k_plot.line(d2[:, 0], d2[:, 1], color=palette[4], line_dash=[4,2], legend='d2')

d2_fit = np.empty([n_samples/2-2, 2])
d2_fit[:, 0] = i2[1:-1]
d2_fit[:, 1] = np.diff(d1_fit[:, 1])/np.diff(d1_fit[:, 0])
k_plot.line(d2_fit[:, 0], d2_fit[:, 1], color=palette[5], legend='d2 fit')

d3 = smooth_diff_n5(d2[:, 0], d2[:, 1])
k_plot.line(d3[:, 0], d3[:, 1], color=palette[6], line_dash=[4,2], legend='d3')

d3_fit = smooth_diff_n5(d2_fit[:, 0], d2_fit[:, 1])
k_plot.line(d3_fit[:, 0], d3_fit[:, 1], color=palette[1], legend='d3 fit')

#k_plot.line(i[1:-1], d2_dist, color=palette[0], line_dash=[5,2], legend='scaled d2')
# k_plot.line(i[1:-1], d2_k_dist, color=palette[2], line_dash=[10,1], legend='scaled np d2')

# k_plot.line(i[1:], d_raw_k_dist, color=palette[3], line_dash=[10,1], legend='raw d')
# k_plot.line(i[1:-1], d2_raw_dist, color=palette[1], line_dash=[5,2], legend='raw d2')
              
k_plot.legend.location = "top_left"

bokeh.plotting.show(k_plot)


/home/schowell/data/myPrograms/anaconda3/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:17: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/home/schowell/data/myPrograms/anaconda3/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:25: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

In [151]:
y


Out[151]:
0.92656301440735367

Find the elbow, or point of maximum curvature


In [231]:
x = np.arange(10)

In [238]:
xx = np.vstack([x,x]).T

In [249]:
d1_norm = smooth_diff(i, raw_k_dist/raw_k_dist.max())

In [253]:
d1_raw = smooth_diff(i, raw_k_dist)

In [273]:
x = np.linspace(0, 2*np.pi)
y = np.sin(x)
dydx = smooth_diff_n5(x, y)
d2ydx2 = smooth_diff_n5(dydx[:, 0], dydx[:, 1])

p = bokeh.plotting.figure()
p.line(x, y, legend='sin(x)', color=palette[0])
p.line(dydx[:, 0], dydx[:, 1], legend='cos(x)', color=palette[1])
p.line(d2ydx2[:, 0], d2ydx2[:, 1], legend='-sin(x)', color=palette[2])

bokeh.plotting.show(p)



In [303]:
# Compute DBSCAN
## Tune these parameters to adjust cluster size ##
distance = 1  # <--need to explore tuning this parameter
min_samples = 2
##################################################
x_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_saxs)
x_core_samples_mask = np.zeros_like(x_db.labels_, dtype=bool)
x_core_samples_mask[x_db.core_sample_indices_] = True
saxs_labels = x_db.labels_ + 1 # 0's are independent groups
x_clusters_ = len(set(saxs_labels)) - (1 if -1 in saxs_labels else 0)

n_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_sans)
n_core_samples_mask = np.zeros_like(n_db.labels_, dtype=bool)
n_core_samples_mask[n_db.core_sample_indices_] = True
sans_labels = n_db.labels_ + 1 # 0's are independent groups
n_clusters_ = len(set(sans_labels)) - (1 if -1 in sans_labels else 0)

p_db = DBSCAN(eps=distance, min_samples=min_samples).fit(scaled_pr)
p_core_samples_mask = np.zeros_like(p_db.labels_, dtype=bool)
p_core_samples_mask[p_db.core_sample_indices_] = True
pr_labels = p_db.labels_ + 1 # 0's are independent groups
p_clusters_ = len(set(pr_labels)) - (1 if -1 in pr_labels else 0)

In [267]:
# x-ray clusters
x_unique = set(saxs_labels)
x_unique.remove(0)
print('cluster labels: {}'.format(x_unique))
print('unique clusters: {}'.format(len(x_unique) + list(saxs_labels).count(0)))
for c in set(saxs_labels):
    print('{}: {}'.format(c+1, list(saxs_labels).count(c)))


cluster labels: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39}
unique clusters: 86
1: 47
2: 4
3: 2
4: 2
5: 6
6: 2
7: 2
8: 11
9: 6
10: 2
11: 5
12: 2
13: 2
14: 2
15: 2
16: 2
17: 2
18: 2
19: 2
20: 5
21: 5
22: 18
23: 2
24: 2
25: 2
26: 2
27: 2
28: 2
29: 18
30: 4
31: 2
32: 2
33: 3
34: 2
35: 11
36: 2
37: 2
38: 2
39: 3
40: 4

In [268]:
# neutron clusters
unique = set(sans_labels)
unique.remove(0)
total_clusters = len(unique) + list(sans_labels).count(0)
print('cluster labels: {}'.format(unique))
print('unique clusters: {}'.format(total_clusters))
for c in set(sans_labels):
    print('{}: {}'.format(c, list(sans_labels).count(c)))


cluster labels: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39}
unique clusters: 86
0: 47
1: 4
2: 2
3: 2
4: 6
5: 2
6: 2
7: 11
8: 6
9: 2
10: 5
11: 2
12: 2
13: 2
14: 2
15: 2
16: 2
17: 2
18: 2
19: 5
20: 5
21: 18
22: 2
23: 2
24: 2
25: 2
26: 2
27: 2
28: 18
29: 4
30: 2
31: 2
32: 3
33: 2
34: 11
35: 2
36: 2
37: 2
38: 3
39: 4

In [269]:
# pr clusters
unique = set(pr_labels)
unique.remove(0)
total_clusters = len(unique) + list(pr_labels).count(0)
print('cluster labels: {}'.format(unique))
print('unique clusters: {}'.format(total_clusters))
for c in set(pr_labels):
    print('{}: {}'.format(c, list(pr_labels).count(c)))


cluster labels: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23}
unique clusters: 166
0: 143
1: 2
2: 2
3: 2
4: 3
5: 2
6: 2
7: 2
8: 2
9: 2
10: 2
11: 5
12: 2
13: 2
14: 4
15: 2
16: 2
17: 2
18: 3
19: 5
20: 3
21: 2
22: 2
23: 2

In [ ]:
np.savetxt('saxs_clusters.txt', saxs_labels, fmt='%d')
np.savetxt('sans_clusters.txt', sans_labels, fmt='%d')
np.savetxt('pr_clusters.txt', pr_labels, fmt='%d')

In [ ]:
print(sans_labels)
print(sans_labels.shape)
slabels = np.array(sans_labels, dtype='str')
# print(slabels)
# print(slabels.shape)

In [ ]:
from matplotlib import offsetbox
i_compare = 0

mn = scaled_saxs.min(axis=0)
mx = scaled_saxs.max(axis=0)

# for i0 in range(1):
for i0 in range(18):
    plt.figure()
    
    # plot points to make the correct box size
    plt.plot(mn[i0], mn[i_compare], 'w.')
    plt.plot(mx[i0], mx[i_compare], 'w.')
    
    for j in range(len(scaled_saxs)):
        if slabels[j] != '0':
            plt.text(scaled_saxs[j, i0], scaled_saxs[j, i_compare], slabels[j],
                     fontdict={'weight': 'bold', 'size': 15}, 
                     color='r') # plt.cm.Set1(labels[i]/10.0))
        else:
            plt.plot(scaled_saxs[j, i0], scaled_saxs[j, i_compare], 'k.',
                    markersize=5)
                
    plt.xlabel(r'$Q_{}$'.format(i0))
    plt.ylabel(r'$Q_{}$'.format(i_compare))

In [298]:
import hdbscan

In [326]:
# Compute DBSCAN
## Tune these parameters to adjust cluster size ##
min_samples = 2
##################################################

x_db = hdbscan.HDBSCAN(min_cluster_size=min_samples).fit_predict(scaled_saxs)

In [323]:
x_db


Out[323]:
HDBSCAN(algorithm='best', allow_single_cluster=False, alpha=1.0,
    approx_min_span_tree=True, core_dist_n_jobs=4, gen_min_span_tree=False,
    leaf_size=40, match_reference_implementation=False,
    memory=Memory(cachedir=None), metric='euclidean', min_cluster_size=2,
    min_samples=None, p=None)

In [327]:
saxs_labels = x_db + 1 # 0's are independent groups
x_clusters_ = len(set(saxs_labels)) - (1 if -1 in saxs_labels else 0)

n_db = hdbscan.HDBSCAN(min_cluster_size=min_samples).fit_predict(scaled_sans)
sans_labels = n_db + 1 # 0's are independent groups
n_clusters_ = len(set(sans_labels)) - (1 if -1 in sans_labels else 0)

p_db = hdbscan.HDBSCAN(min_cluster_size=min_samples).fit_predict(scaled_pr)
pr_labels = p_db + 1 # 0's are independent groups
p_clusters_ = len(set(pr_labels)) - (1 if -1 in pr_labels else 0)

In [328]:
# x-ray clusters
x_unique = set(saxs_labels)
x_unique.remove(0)
print('cluster labels: {}'.format(x_unique))
print('unique clusters: {}'.format(len(x_unique) + list(saxs_labels).count(0)))
for c in set(saxs_labels):
    print('{}: {}'.format(c, list(saxs_labels).count(c)))


cluster labels: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32}
unique clusters: 96
0: 64
1: 10
2: 3
3: 4
4: 3
5: 4
6: 3
7: 5
8: 4
9: 5
10: 4
11: 4
12: 2
13: 6
14: 2
15: 5
16: 3
17: 2
18: 2
19: 3
20: 5
21: 8
22: 6
23: 7
24: 3
25: 5
26: 4
27: 4
28: 3
29: 5
30: 3
31: 5
32: 4

In [329]:
# neutron clusters
unique = set(sans_labels)
unique.remove(0)
total_clusters = len(unique) + list(sans_labels).count(0)
print('cluster labels: {}'.format(unique))
print('unique clusters: {}'.format(total_clusters))
for c in set(sans_labels):
    print('{}: {}'.format(c, list(sans_labels).count(c)))


cluster labels: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}
unique clusters: 88
0: 57
1: 10
2: 2
3: 3
4: 4
5: 4
6: 5
7: 5
8: 5
9: 3
10: 5
11: 4
12: 4
13: 4
14: 3
15: 6
16: 6
17: 2
18: 7
19: 5
20: 5
21: 5
22: 4
23: 2
24: 5
25: 8
26: 3
27: 3
28: 5
29: 7
30: 5
31: 4

In [330]:
# pr clusters
unique = set(pr_labels)
unique.remove(0)
total_clusters = len(unique) + list(pr_labels).count(0)
print('cluster labels: {}'.format(unique))
print('unique clusters: {}'.format(total_clusters))
for c in set(pr_labels):
    print('{}: {}'.format(c, list(pr_labels).count(c)))


cluster labels: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57}
unique clusters: 85
0: 28
1: 2
2: 6
3: 4
4: 3
5: 2
6: 4
7: 5
8: 3
9: 2
10: 3
11: 4
12: 3
13: 2
14: 2
15: 2
16: 2
17: 2
18: 2
19: 4
20: 3
21: 4
22: 3
23: 2
24: 2
25: 2
26: 3
27: 6
28: 2
29: 4
30: 3
31: 2
32: 2
33: 2
34: 2
35: 2
36: 2
37: 3
38: 2
39: 2
40: 2
41: 3
42: 7
43: 3
44: 2
45: 3
46: 6
47: 5
48: 4
49: 6
50: 3
51: 2
52: 3
53: 3
54: 2
55: 3
56: 3
57: 2

In [331]:
np.savetxt('saxs_clusters.txt', saxs_labels, fmt='%d')
np.savetxt('sans_clusters.txt', sans_labels, fmt='%d')
np.savetxt('pr_clusters.txt', pr_labels, fmt='%d')

In [332]:
print(sans_labels)
print(sans_labels.shape)
slabels = np.array(sans_labels, dtype='str')
# print(slabels)
# print(slabels.shape)


[11 11 11 11  0 17 14 17  0 14 14 15  0 15  0  0 15 15 15 15  1  1  1  1  1
  1 13  0 13 13 13  0 22 22 18 18 18 18 18 18  0  0  0  0 20 21 21 21  0 21
 20  0  0  0 22 20 20 20 21  2  0  2 16  0 12 16 16  0 19 16  1  1  1  1 22
  0  0  0  0  0 12  0  0  7  7  7  7  7  8  8  8  0 16 16 19 19 19 19  0  0
  0 18  0  0 25 25 25 25 12 12 25 25 24 24 24 24  0 24  0 25 25  0  0  0  0
  0  0 29 29 30 31 29 29 31 31 30 30 30 30 10 10  5 10 10 10  5  5  0  5  0
  0  0 23  0 27 27 28 28 28 26 26 26 28 27 28 23  0  0  0  0  0 29 31 29  0
  0  9  9  9  0  8  8  0  6  6  6  6  6  0  3  3  4  4  3  4  4 29  0  0  0]
(200,)

In [333]:
from matplotlib import offsetbox
i_compare = 0

mn = scaled_saxs.min(axis=0)
mx = scaled_saxs.max(axis=0)

# for i0 in range(1):
for i0 in range(18):
    plt.figure()
    
    # plot points to make the correct box size
    plt.plot(mn[i0], mn[i_compare], 'w.')
    plt.plot(mx[i0], mx[i_compare], 'w.')
    
    for j in range(len(scaled_saxs)):
        if slabels[j] != '0':
            plt.text(scaled_saxs[j, i0], scaled_saxs[j, i_compare], slabels[j],
                     fontdict={'weight'metrics: 'bold', 'size': 15}, 
                     color='r') # plt.cm.Set1(labels[i]/10.0))
        else:
            plt.plot(scaled_saxs[j, i0], scaled_saxs[j, i_compare], 'k.',
                    markersize=5)
                
    plt.xlabel(r'$Q_{}$'.format(i0))
    plt.ylabel(r'$Q_{}$'.format(i_compare))


Write DCD output


In [ ]:
import sasmol.sasmol as sasmol

dcd_fname = glob.glob('*.dcd')
assert len(dcd_fname) == 1, 'ERROR: unsure which dcd file to use: {}'.format(dcd_fname)
dcd_fname = dcd_fname[0]

pdb_fname = glob.glob('*.pdb')
assert len(pdb_fname) == 1, 'ERROR: unsure which dcd file to use: {}'.format(pdb_fname)
pdb_fname = pdb_fname[0]

In [ ]:
mol = sasmol.SasMol(0)
mol.read_pdb(pdb_fname)

In [ ]:
if not np.alltrue(sans_labels == saxs_labels):
    print('WARNING: labels do not match\nusing neutron labels')
labels = sans_labels

In [ ]:


In [ ]:
dcd_fname

In [ ]:
# create a dcd for every cluster with >1 frame
dcd_fnames = []
cluster_out_files = [] # dcds for clusters
unique_out_fname = '{}_uniue.dcd'.format(dcd_fname[:-4]) 
dcd_out_file = mol.open_dcd_write(unique_out_fname) # dcd file for unique structures

dcd_in_file = mol.open_dcd_read(dcd_fname)

for i in xrange(len(unique)):
    dcd_fnames.append('{}_c{:02d}.dcd'.format(dcd_fname[:-4], i))
    cluster_out_files.append(mol.open_dcd_write(dcd_fnames[i]))

visited_cluster = set()
dcd_out_frame = 0
cluster_out_frame = np.zeros(len(unique), dtype=int)

for (i, label) in enumerate(labels):
    mol.read_dcd_step(dcd_in_file, i)
    if label == 0:
        dcd_out_frame += 1
        mol.write_dcd_step(dcd_out_file, 0, dcd_out_frame)
    else:
        cluster_out_frame[label-1] += 1
        # print('adding frame to cluster {}'.format(label-1))
        # print(cluster_out_frame)
        mol.write_dcd_step(cluster_out_files[label-1], 0, cluster_out_frame[label-1])
        if label not in visited_cluster:
            visited.add(label)
            dcd_out_frame += 1
            mol.write_dcd_step(dcd_out_file, 0, dcd_out_frame)
        
for cluster_out_file in cluster_out_files:
    mol.close_dcd_write(cluster_out_file)

mol.close_dcd_write(dcd_out_file)    
mol.close_dcd_read(dcd_in_file[0])

PCA Analysis


In [ ]:
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [ ]:
pca_orig = PCA()
pca_orig.fit(saxs_data)

In [ ]:
pca_scaled = PCA()
pca_scaled.fit(scaled_saxs)

In [ ]:
print(pca_orig.explained_variance_ratio_)
print(pca_scaled.explained_variance_ratio_)

In [ ]:
plt.figure()
plt.plot(q_values, pca_orig.explained_variance_ratio_, 'o', label='unscaled')
plt.plot(q_values, pca_scaled.explained_variance_ratio_, 's', label='scaled')
plt.legend()

In [ ]:


In [ ]:
from sklearn.datasets.samples_generator import make_blobs

In [ ]:
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)

In [ ]:
X_scaled = StandardScaler().fit_transform(X)

In [ ]:
X_range = X.max(axis=0) - X.min(axis=0)
print(X_range)

In [ ]:
saxs_scaled_range = X_scaled.max(axis=0) - X_scaled.min(axis=0)
print(saxs_scaled_range)

In [ ]:
X_s2 = StandardScaler().fit_transform(X)

In [ ]:
X_s2_range = X_s2.max(axis=0) - X_s2.min(axis=0)
print(X_s2_range)

In [ ]:


In [ ]: